Generalized Biwords for Bitext Compression and Translation Spotting: Extended Abstract

نویسندگان

Felipe Sánchez-Martínez

Rafael C. Carrasco

Miguel A. Martínez-Prieto

Joaquín Adiego

چکیده

The increasing availability of large collections of bilingual parallel corpora has fostered the development of naturallanguage processing applications that address bilingual tasks, such as corpus-based machine translation, the automatic extraction of bilingual lexicons, and translation spotting [Simard, 2003]. A bilingual parallel corpus, or bitext, is a textual collection that contains pairs of documents which are translations of one another. In the words of Melamed [2001, p. 1], “bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed as a detailed annotation of what that text means”. Large bitexts are usually available in a compressed form in order to reduce storage requirements, to improve access times [Ziviani et al., 2000], and to increase the efficiency of transmissions. However, the independent compression of the two texts of a bitext is clearly far from efficient because the information contained in both texts is redundant. Previous work [Nevill-Manning and Bell, 1992; Conley and Klein, 2008; Martı́nez-Prieto et al., 2009; Adiego et al., 2009; 2010] has shown that bitexts can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. Martı́nez-Prieto et al. [2009], and Adiego and his colleagues [2009; 2010] propose the use of biwords —pairs of words, each one from a different text, with a high probability of co-occurrence— as input units for the compression of bitexts. This means that a biword-based intermediate representation of the bitext is obtained by exploiting alignments, and unaligned words are encoded as pairs in which one component is the empty string. Significant spatial savings are achieved with this technique [Martı́nez-Prieto et al., 2009], although the compression of biword sequences requires larger dictionaries than the traditional text compression methods. The biword-based compression approach works as a simple processing pipeline consisting of two stages (see Figure 1). After a text alignment has been obtained without pre-existing linguistic resources, the first stage transforms the bitext into a biword sequence. The second stage then compresses this sequence. Decompression works in reverse order: the biword sequence representing the bitext is first generated

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generalized Biwords for Bitext Compression and Translation Spotting

Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords —pairs of parallel words with a high probability of cooccurrence— that can be used as an intermediat...

متن کامل

Harnessing the Redundant Results of Translation Spotting

Translation spotting consists in automatically identifying the translations of a user query inside a bitext. This task, when it relies solely on statistical word alignment algorithms, fails to achieve excellent results. In this paper, we show that identifying the translations of a query during a first translation spotting stage provides relevant information that can be used in a second stage to...

متن کامل

Boosting Bitext Compression

Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bitexts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each ...

متن کامل

A Two-Level Structure for Compressing Aligned Bitexts

A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our...

متن کامل

An Attentional Model for Speech Translation Without Transcription

For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Generalized Biwords for Bitext Compression and Translation Spotting: Extended Abstract

نویسندگان

چکیده

منابع مشابه

Generalized Biwords for Bitext Compression and Translation Spotting

Harnessing the Redundant Results of Translation Spotting

Boosting Bitext Compression

A Two-Level Structure for Compressing Aligned Bitexts

An Attentional Model for Speech Translation Without Transcription

عنوان ژورنال:

اشتراک گذاری